About EDA
Exploratory Data Analysis (EDA) is the process of conducting a first
pass at a new or poorly understood set of data with the goal of
exploring the shape, structure, and contents of the dataset.
EDA can help Analysts and Data Scientists to understand the scope of the
data and how much work will likely be required to clean and transform
the dataset into a usable and useful format. The purpose of EDA is not
necessarily to answer questions about the content of the dataset, but to
find valuable questions to ask of it. This is accomplished through a
number of techniques ranging from simple summary statistics and data
visualization to relatively sophisticated modelling to understand the
nature of patterns and relationships in the data.
EDA can be accomplished via a number of means however language based
approaches offer inherent repeatability and reproducibility, along with
the option to document the potentially complex processes of data
transformation and cleaning in-line with the code. Additionally, work
done through scripting language is readily extensible for further
development and can form the framework for rapid prototyping and
eventually production.
About the R Language
R is a free software environment for statistical computing and
graphics. It compiles and runs on a wide variety of UNIX platforms,
Windows and MacOS. - https://www.r-project.org/
The R language was developed primarily to work with data and statistical
analysis and as such is uniquely suited for EDA work. It is an open
source language with a number of packages that extend it’s functionality
and has an incredibly collaborative and supportive user base.
Relevant Operator Definitions
<- functions as the assignment operator in R and can
be read as from right to left as
[This_Variable_Name] <- (assign as) [This_Object]
%>% is the pipe operator in the R Tidyverse and
is used to string together operations. It can be read from left to right
as
[Do this operation] %>% (and then) [Do this other operation]
library(tidyverse)
library(plotly)
The tidyverse is a metapackage comprised of multiple packages
designed to extend the functionality of the R language and to work well
together.
The plotly package creates client or server side interactive data
visualizations.
bob_ross_df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-21/bob_ross.csv')
This is a publicly available dataset featured in the weekly Tidy Tuesday R community github repository.
Dataset shape
glimpse(bob_ross_df)
## Rows: 403
## Columns: 27
## $ painting_index <dbl> 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292…
## $ img_src <chr> "https://www.twoinchbrush.com/images/painting282.png"…
## $ painting_title <chr> "A Walk in the Woods", "Mt. McKinley", "Ebony Sunset"…
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ episode <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ num_colors <dbl> 8, 8, 9, 3, 8, 4, 8, 8, 8, 8, 8, 4, 8, 12, 12, 13, 3,…
## $ youtube_src <chr> "https://www.youtube.com/embed/oh5p5f5_-7A", "https:/…
## $ colors <chr> "['Alizarin Crimson', 'Bright Red', 'Cadmium Yellow',…
## $ color_hex <chr> "['#4E1500', '#DB0000', '#FFEC00', '#102E3C', '#021E4…
## $ Black_Gesso <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,…
## $ Bright_Red <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Burnt_Umber <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Cadmium_Yellow <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Dark_Sienna <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Indian_Red <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Indian_Yellow <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Liquid_Black <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Liquid_Clear <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Midnight_Black <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Phthalo_Blue <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Phthalo_Green <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Prussian_Blue <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Sap_Green <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Titanium_White <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Van_Dyke_Brown <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Yellow_Ochre <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Alizarin_Crimson <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
The dataset is comprised of 403 rows across 27 columns. Each record in the dataset appears to be a single episode with descriptive information.
Dataset structure
library(skimr)
skim(bob_ross_df) %>%
arrange(desc(logical.mean))
| Name | bob_ross_df |
| Number of rows | 403 |
| Number of columns | 27 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| logical | 18 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| img_src | 0 | 1 | 49 | 51 | 0 | 403 | 0 |
| painting_title | 0 | 1 | 8 | 27 | 0 | 401 | 0 |
| youtube_src | 0 | 1 | 41 | 42 | 0 | 403 | 0 |
| colors | 0 | 1 | 18 | 261 | 0 | 176 | 0 |
| color_hex | 0 | 1 | 11 | 165 | 0 | 175 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| Titanium_White | 0 | 1 | 0.99 | TRU: 400, FAL: 3 |
| Alizarin_Crimson | 0 | 1 | 0.94 | TRU: 380, FAL: 23 |
| Van_Dyke_Brown | 0 | 1 | 0.92 | TRU: 371, FAL: 32 |
| Cadmium_Yellow | 0 | 1 | 0.86 | TRU: 346, FAL: 57 |
| Yellow_Ochre | 0 | 1 | 0.81 | TRU: 327, FAL: 76 |
| Phthalo_Blue | 0 | 1 | 0.80 | TRU: 323, FAL: 80 |
| Bright_Red | 0 | 1 | 0.80 | TRU: 321, FAL: 82 |
| Midnight_Black | 0 | 1 | 0.79 | TRU: 317, FAL: 86 |
| Sap_Green | 0 | 1 | 0.76 | TRU: 306, FAL: 97 |
| Indian_Yellow | 0 | 1 | 0.72 | TRU: 292, FAL: 111 |
| Dark_Sienna | 0 | 1 | 0.72 | TRU: 290, FAL: 113 |
| Prussian_Blue | 0 | 1 | 0.65 | TRU: 263, FAL: 140 |
| Phthalo_Green | 0 | 1 | 0.29 | FAL: 287, TRU: 116 |
| Black_Gesso | 0 | 1 | 0.25 | FAL: 302, TRU: 101 |
| Burnt_Umber | 0 | 1 | 0.14 | FAL: 348, TRU: 55 |
| Liquid_Clear | 0 | 1 | 0.13 | FAL: 352, TRU: 51 |
| Liquid_Black | 0 | 1 | 0.03 | FAL: 389, TRU: 14 |
| Indian_Red | 0 | 1 | 0.00 | FAL: 402, TRU: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| painting_index | 0 | 1 | 208.71 | 117.90 | 4 | 107.5 | 210 | 310.5 | 411 | ▇▇▇▇▇ |
| season | 0 | 1 | 16.00 | 8.96 | 1 | 8.0 | 16 | 24.0 | 31 | ▇▇▇▇▇ |
| episode | 0 | 1 | 7.00 | 3.75 | 1 | 4.0 | 7 | 10.0 | 13 | ▇▅▇▅▇ |
| num_colors | 0 | 1 | 10.61 | 2.38 | 1 | 9.0 | 11 | 12.0 | 15 | ▁▁▃▇▃ |
Thankfully this dataset has no records with missing information and requires little to no cleaning and modification before analysis can begin. Note that the use of colors show by num_colors in the numeric variables appears to have a slightly left skewed distribution.
Further EDA
library(DataExplorer)
bob_ross_df %>%
group_by(season) %>%
summarise(across(c(Black_Gesso:Alizarin_Crimson),
mean)) %>%
ungroup() %>%
plot_correlation(title = 'Color Use per Season Correlation Plot',
theme_config = list(title = element_text(face = 'bold'),
axis.text.x = element_text(angle = 90),
legend.position = 'bottom'))
Fig 1, Color Use per Season Correlation Plot
The DataExplorer package is an incredibly powerful tool for exploring datasets in depth with minimal coding. Here we see one function from the package for creating correlation plots. It looks like there are several possible color and season correlations worth exploring further.
episode_count_df <-
bob_ross_df %>%
count(season, name = 'episodes_per_season') %>%
count(`episodes_per_season`, name = 'number_of_seasons')
episode_count_df
## # A tibble: 1 × 2
## episodes_per_season number_of_seasons
## <int> <int>
## 1 13 31
Bob recorded 31 seasons comprised of 13 episodes each!
library(htmlwidgets)
bob_ross_df %>%
transmute(
episode = paste0('S', season, ':Ep', episode),
srt = row_number(),
num_colors,
painting_title,
img_src
) %>%
plot_ly(
x = ~ fct_reorder(episode, srt),
y = ~ num_colors,
color = I('#1A237E'),
type = 'scatter',
mode = 'lines+markers',
text = ~ paste0(
'Episode: <b>',
episode,
'</b><br>',
'Count of Colors: <b>',
num_colors,
'</b><br>',
'Title: <b>',
painting_title,
'</b><br>',
'<i>click point to open painting</i>'
),
hoverinfo = 'text',
customdata = ~ img_src
) %>%
layout(
title = list(text = '<b>Colors Used per Episode</b>\n<i> hover over points for more info</i>',
x = 0.05),
xaxis = list(title = FALSE),
yaxis = list(title = '<b>Color Count</b>')
) %>%
onRender(
"
function(el, x) {
el.on('plotly_click', function(d) {
var point = d.points[0];
var url = point.data.customdata[point.pointIndex];
window.open(url);
});
}"
)
Fig 2, Color Count per Episode Line Chart
We see an initial upswing in the number of colors used per episode that becomes relatively stable in later seasons. By putting additional information in the hover over tooltip, we ease the cognitive burden of the audience. Additionally, simple JavaScript functions can extend the visualization functionality with options such as the one demonstrated here for linking to the described painting.
Density plot of individual color usage rates by season
library(ggridges)
bob_ross_df %>%
select(season,
Black_Gesso:Alizarin_Crimson) %>%
pivot_longer(cols = !matches('season')) %>%
group_by(season,
name) %>%
summarise(use_rate = mean(value)) %>%
ungroup() %>%
group_by(name) %>%
mutate(name_srt = mean(use_rate)) %>%
ungroup() %>%
ggplot(aes(x = use_rate,
y = fct_reorder(name, name_srt))) +
geom_density_ridges_gradient(aes(fill = after_stat(x)),
scale = 1.5,
color = '#FFFFFF') +
scale_fill_gradient(low = '#C5CAE9',
high = '#1A237E',
name = 'Use Rate') +
labs(title = 'Color Use Rate per Season Distribution',
x = 'Use Rate per Season',
y = NULL)
Fig 3.1, Color Use Distribution Density Ridge Plot
This is a density ridge plot with a gradient fill that shows the distribution of color usage rates per season across all colors. We can see that Titanium White was highly used, Indian Red infrequently used, and Phthalo Blue had a highly variable use per season.
colors_per_season_df <-
bob_ross_df %>%
group_by(season) %>%
summarise(across(c(Black_Gesso:Alizarin_Crimson),
~ mean(.))) %>%
ungroup() %>%
pivot_longer(cols = !contains('season')) %>%
mutate(name = str_replace_all(name, '_', ' '))
colors_df <-
bob_ross_df %>%
select(colors,
color_hex) %>%
mutate(across(everything(),
~ str_extract(., "'.*'[,|\\]]"))) %>%
mutate(across(everything(),
~ str_remove_all(., "\\\\r|\\\\n|'|\\]"))) %>%
separate_longer_delim(cols = everything(),
delim = ',') %>%
mutate(across(everything(),
~ str_squish(.))) %>%
distinct() %>%
rename(name = colors)
colors_per_season_df %>%
left_join(colors_df,
na_matches = 'never') %>%
highlight_key(~ name) %>%
plot_ly(x = ~ season,
y = ~ value) %>%
add_bars(
color = ~ I(color_hex),
marker = list(line = list(width = 1,
color = 'black')),
text = ~ paste0(
'Season: <b>',
season,
'</b><br>',
'Color: <b>',
name,
'</b><br>',
'Use Rate: <b>',
round(value, 2),
'</b>'
),
hoverinfo = 'text',
textposition = 'none'
) %>%
layout(barmode = 'stack',
title = list(text = '<b>Color Usage Rate per Season</b>\n<i> hover over bars for more info, click bar to highlight</i>',
x = 0.05),
xaxis = list(title = '<b>Season</b>'),
yaxis = list(title = '<b>Color Usage Rate</b>'))
Fig 3.2, Color Use Rate Stacked Barchart
By using relatively simple regular expressions (regex) supported in R, we can extract the actual colors by their hexadecimal code and map them to their respective bars in this visualization.
library(tidytext)
title_tokens_df <-
bob_ross_df %>%
select(season,
painting_title) %>%
unnest_tokens(input = painting_title,
output = 'word',
drop = FALSE)
title_tokens_df %>%
anti_join(stop_words) %>%
mutate(word = str_remove(word, 's$')) %>%
count(word) %>%
slice_max(order_by = n,
n = 25) %>%
plot_ly(x = ~ n,
y = ~ fct_reorder(word, n)) %>%
add_bars(
color = I('#1A237E'),
text = ~ paste0('Word: <b>', word, '</b><br>',
'Count: <b>', n, '</b>'),
hoverinfo = 'text',
textposition = 'none'
) %>%
layout(
title = list(text = '<b>Frequently Used Words in Painting Title</b>',
x = 0.05),
xaxis = list(title = '<b>Word Count</b>'),
yaxis = list(
title = FALSE,
tickprefix = '<b>',
ticksuffix = '</b> '
)
)
Fig 4, Top Painting Title Term Frequency Barchart
The tidytext package greatly simplifies the steps required for text mining. Here we can easily unnest the painting titles into columns of their discrete words for further analysis.
library(tidylo)
title_lo_top_df <-
title_tokens_df %>%
mutate(word = str_remove(word, 's$')) %>%
count(season,
word) %>%
bind_log_odds(set = season,
feature = word,
n = n) %>%
anti_join(stop_words) %>%
slice_max(order_by = log_odds_weighted,
n = 50)
title_lo_top_df %>%
ggplot(aes(x = log_odds_weighted,
y = fct_reorder(word, log_odds_weighted))) +
geom_col(fill = '#1A237E') +
facet_wrap(vars(season),
scales = 'free') +
theme(axis.title.y = element_blank())
Fig 5.1, Top Topic Weighted Log Odds by Season Trellis Chart
The tidylo package is small but powerful in that it enable the calculation of weighted log odds with uninformative Dirichlet priors. This gives an improvement when compared to the tidytext package’s Term Frequency vs Inverse Document Frequency (TF-IDF) when dealing with small term frequencies. It is worth noting that recent research indicates that the inclusion of stop words in the initial calculation or further modelling efforts generates more accurate results, however it is still good practice to remove them when communicating results to lay audiences to prevent misinterpretation.
title_lo_top_df %>%
group_by(season) %>%
group_by(word) %>%
mutate(word_srt = mean(log_odds_weighted)) %>%
ungroup() %>%
plot_ly(x = ~ season,
y = ~ fct_reorder(word, word_srt)) %>%
add_heatmap(
z = ~ log_odds_weighted,
xgap = 1,
ygap = 1,
text = ~ paste0(
'Season: <b>',
season,
'</b><br>',
'Title Word: <b>',
word,
'</b><br>',
'Wt. Log Odds: <b>',
round(log_odds_weighted, 2),
'</b>'
),
hoverinfo = 'text'
) %>%
layout(title = list(text = '<b>Topic Heatmap</b>',
x = 0.05),
yaxis = list(title = FALSE),
xaxis = list(title = '<b>Season</b>'))
Fig 5.2, Top Topic Weighted Log Odds Heatmap
This heatmap is an alternate visual for showing important topics by season. The advantage here is the ability to follow a topic or season across an axis, however the trade off is a slight decrease in readability.
library(tidymodels)
library(themis)
# prepare dataframe for modelling
model_data_df <-
bob_ross_df %>%
select(painting_title,
Black_Gesso:Alizarin_Crimson) %>%
mutate(has_winter = if_else(str_detect(str_to_lower(painting_title),
'winter'),
'1_winter',
'2_no_winter')) %>%
select(-painting_title) %>%
mutate(across(c(Black_Gesso:Alizarin_Crimson),
~ if_else(., 1, 0)))
glimpse(model_data_df)
## Rows: 403
## Columns: 19
## $ Black_Gesso <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Bright_Red <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Burnt_Umber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,…
## $ Cadmium_Yellow <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Dark_Sienna <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Indian_Red <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Indian_Yellow <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ Liquid_Black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Liquid_Clear <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Midnight_Black <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Phthalo_Blue <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,…
## $ Phthalo_Green <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Prussian_Blue <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Sap_Green <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Titanium_White <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Van_Dyke_Brown <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Yellow_Ochre <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,…
## $ Alizarin_Crimson <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ has_winter <chr> "2_no_winter", "2_no_winter", "2_no_winter", "1_winte…
Creating a separate dataframe of only the outcome we want to predict and the features we would like to use for prediction eases additional downstream transformations. Depending on the final modelling algorithm used, this can be an iterative step.
# create training and testing data splits
set.seed(42)
raw_split <- model_data_df %>%
initial_split(strata = has_winter)
raw_train <- training(raw_split)
raw_test <- testing(raw_split)
raw_split
## <Training/Testing/Total>
## <301/102/403>
Here we see the ~ 70/30 training/testing split of our data. We want to train and assess the potential model while keeping the testing portion as a holdout to be used only once for final model evaluation.
# create bootstrap data folds for training and cross validation assessment
set.seed(123)
raw_folds <- vfold_cv(raw_train,
strata = has_winter)
raw_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [270/31]> Fold01
## 2 <split [271/30]> Fold02
## 3 <split [271/30]> Fold03
## 4 <split [271/30]> Fold04
## 5 <split [271/30]> Fold05
## 6 <split [271/30]> Fold06
## 7 <split [271/30]> Fold07
## 8 <split [271/30]> Fold08
## 9 <split [271/30]> Fold09
## 10 <split [271/30]> Fold10
Because this dataset is so small, we will create additional folds of the dataset. This is done through bootstrap resampling and will result in number of folds used to assess and modify the model during training through cross-fold validation.
# create model step recipe
raw_rec <-
recipe(has_winter ~ .,
data = raw_train) %>%
step_upsample(has_winter) %>%
step_dummy(all_nominal_predictors())
# set model algorithm and specs
xgb_spec <-
boost_tree(
trees = tune(),
mtry = tune(),
min_n = tune(),
learn_rate = 0.1
) %>%
set_engine('xgboost') %>%
set_mode('classification')
# create model workflow
xgb_wf <- workflow(raw_rec, xgb_spec)
xgb_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_upsample()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
## learn_rate = 0.1
##
## Computational engine: xgboost
The tidymodels metapackage makes it easy to string together the common modelling steps into recipes and workflows that are easy for humans to read and comprehend. Here we have a dataset well suited to decision tree models and will be using boosted decision trees with the popular XGBoost algorithm. This algorithm has a large number of hyperparameters we could tune to change model performance, however here we will use some of the built-in package functionality to tune the major hyperparameters for us on the already built training folds.
Here we can see the results of the model tuning
library(finetune)
# turn on parallel processing
doParallel::registerDoParallel()
# Tune the model hyperparameters
set.seed(628)
xgb_raw_rs <-
tune_race_anova(
xgb_wf,
raw_folds,
grid = 20,
control = control_race(verbose_elim = TRUE)
)
plot_race(xgb_raw_rs)
Fig 6.1, ANOVA Race Tuning
show_best(xgb_raw_rs)
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 206 19 roc_auc binary 0.878 10 0.0308 Preprocessor1_Model10
## 2 1 465 20 roc_auc binary 0.877 10 0.0310 Preprocessor1_Model01
## 3 14 877 31 roc_auc binary 0.875 10 0.0304 Preprocessor1_Model16
## 4 7 104 16 roc_auc binary 0.874 10 0.0308 Preprocessor1_Model08
## 5 5 379 37 roc_auc binary 0.874 10 0.0334 Preprocessor1_Model05
The Race ANOVA tuning function is nice in that as hyperparameter combinations are shown to be less performant they are dropped from further modelling reducing total time and resources. Here we can see the separate trials and the best performing variants.
xgb_last <-
xgb_wf %>%
finalize_workflow(select_best(xgb_raw_rs, 'roc_auc')) %>%
last_fit(raw_split)
xgb_fit <- extract_fit_parsnip(xgb_last)
final_preds <-
xgb_last %>%
collect_predictions()
final_preds %>%
roc_curve(has_winter, .pred_1_winter) %>%
plot_ly(
x = ~ 1 - specificity,
y = ~ sensitivity,
showlegend = FALSE
) %>%
add_lines(
color = I('#1A237E'),
line = list(width = 4),
text = ~ paste0(
'Specificity: <b>',
round(specificity, 3),
'</b><br>',
'Sensitivity: <b>',
round(sensitivity, 3),
'</b>'
),
hoverinfo = 'text'
) %>%
add_segments(
x = 0,
xend = 1,
y = 0,
yend = 1,
line = list(dash = 'dash'),
color = I('#BDBDBD')
) %>%
layout(xaxis = list(scaleanchor = 'y',
scaleratio = 1),
title = list(text = '<b>ROC curve</b>',
x = 0.05))
Fig 6.2, Final Model Fit AUC Chart
collect_metrics(xgb_last)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.833 Preprocessor1_Model1
## 2 roc_auc binary 0.939 Preprocessor1_Model1
final_preds %>%
conf_mat(has_winter, .pred_class)
## Truth
## Prediction 1_winter 2_no_winter
## 1_winter 11 17
## 2_no_winter 0 74
We can see that our model seems to perform well, however we would
potentially need to modify further depending on how the model will be
deployed. Decision thresholds could be customized to decrease the number
of false positives at the cost of increasing false negatives should
absolute certainty be needed in prediction of a Winter painting.
Again, these changes should be made using training data during the
assessment stage and prior to this final fit but were not shown here for
the sake of demonstration length.
library(SHAPforxgboost)
FTR_Shap <-
shap.prep(
xgb_model = extract_fit_engine(xgb_fit),
X_train = bake(
prep(raw_rec),
has_role('predictor'),
new_data = NULL,
composition = 'matrix'
)
)
shap.plot.summary(FTR_Shap) +
labs(title = 'Color Use Impact on Likelihood of Winter Painting') +
theme(title = element_text(face = 'bold'),
axis.text.y = element_text(face = 'bold'))
Fig 6.3, Final Model Fit SHAP Beeswarm Summary Plot
Decisions trees are highly useful algorithms when model explainability is a must. SHAP values have recently been making a comeback as a way to explain both variable importance and impact on model outputs, but also variable interactions. The SHAPforxgboost package gives access to a number of data visualizations for SHAP values.
EDA Results
Over the course of 31 seasons comprised of 13
episodes each, Bob Ross painted a total of 403
paintings!
His subjects ranged from forest cabins to
winter landscape scenes, and he was particularly fond
of painting mountains. The topics he chose to paint
were broad, but occasionally followed predictable patterns with several
topics presenting across multiple seasons. Season six
however seemed to be a turning point as the number of topics he painted
varied wildly.
Bob’s choice of colors ramped up over the first few seasons, but then
was relatively steady through the full production run. His use of color
seems to be associated with his choice of topic, with an initial model
predicting Winter paintings showing promising early
results.
Recommendations
Further analysis is warranted regarding the predictability of all common
topics given the color usage. Decision thresholds for predictions may
need be to be customized as this is a small dataset with wide ranging
topics.
A literature review could also be helpful in identifying additional data
regarding his changing choice of topics and slightly evolving use of
colors and styles.
End of Document